Shortcut input/output in virtual machine systems

ABSTRACT

Read requests to a commonly accessed storage volume are conditionally issued, depending on whether or not a requested data block is already stored in memory from a prior access or to be stored in memory upon completion of a pending request. A data structure is maintained in memory to track physical memory pages and to indicate for each physical memory page the corresponding location in the storage volume from which the contents of the physical memory were read and the number of virtual memory pages that are mapped thereto.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional PatentApplication Ser. No. 61/179,612, filed May 19, 2009. The entire contentof the provisional application is incorporated by reference herein.

BACKGROUND

A virtual machine (VM) provides an environment in which an operatingsystem (OS) may execute with apparent control of a dedicated physicalmachine. Multiple VMs may execute on a common hardware machine, and eachVM may operate with protection and isolation from other VMs executing onthe common hardware machine. Each VM typically encapsulates a completeexecuting state for a corresponding OS, including both user-levelapplications and kernel-mode OS services.

A VM may execute an instance of a conventional desktop environmentcommonly found on a desktop computer. A virtual desktop infrastructure(VDI) system is a system of VMs specifically configured to executeinstances of a desktop environment on the common hardware machine. Aclient device may access a specific instance of the desktop environmentwithin the VDI system via a network connection to the common hardwaremachine. The client device is then able to present the desktopenvironment to a user from any location with appropriate network access.

Each instance of the desktop environment comprises a desktop OS andrelated applications. The desktop OS and applications reside in a massstorage volume associated with the instance. Each instance is associatedwith a unique mass storage volume that provides a private storage spacesimilar to that provided by a hard disk attached to a desktop computer.The mass storage volume may be derived from a shared, read-only “basedisk” and a unique set of difference blocks associated with the massstorage volume.

The mass storage volumes of different instances of the desktopenvironment reside on a mass storage device. The mass storage device isconventionally implemented using one or more magnetic hard disk drives.However, any form of mass storage media may be used. For example, inmodern computer systems, the mass storage media may comprise asolid-state drive (SSD) or an array of SSDs.

When a plurality of desktop environments is started by the VDI system,each desktop environment individually boots a respective instance of theOS on an apparently private hardware machine, provided by an associatedVM. In actuality, however, the common hardware machine is performing allnecessary steps and instructions to boot each of the plurality ofdesktop environments. A “boot storm” refers to a sharp and sometimecrippling rise in resource utilization with respect to the commonhardware machine that occurs when the plurality of desktop environmentsattempt to simultaneously boot within their respective VMs. The bootstorm is typically characterized by a sharp rise in input/output (I/O)requests, disk access requests, and memory allocation requests. Wheneach desktop environment boots, a specific set of commonly used diskblocks is read from the mass storage device into system memory withinthe common hardware machine. Many of the commonly used disk blocks aredata segments of the OS. In a conventional VDI system, each disk blockmay be read from the mass storage device and then stored in physicalmemory pages of the system memory on the common hardware machine. Eachphysical memory page is then privately mapped to a VM that requested thedisk block. As a result, duplicate copies of each commonly used blockmay be requested from the mass storage system and stored redundantly insystem memory, leading to I/O and storage inefficiencies.

For example, in a conventional VDI system, if N instances of the desktopenvironment are booted on the common hardware machine, then N copies ofeach block from the set of commonly used blocks are separately requestedfrom the mass storage system and initially stored as N separate copieswithin the system memory. Similarly, if M different users launch aparticular common application from their respective virtual desktopenvironment, then M separate requests for each block used by theapplication are separately transmitted to the mass storage system andrelated data is stored in M separate blocks in system memory. In theboot storm scenario, as well as the common application launch scenario,significant memory and I/O capacity is utilized to support multiple VMsbooting and executing multiple instances of the desktop environment.Unfortunately, much of this memory and I/O capacity is utilizedredundantly, and therefore limits advantages otherwise gained by a VDIsystems architecture.

Therefore, what is needed in the art is a technique for reducing systemresource utilization in VDI and other similar systems.

SUMMARY

One or more embodiments of the present invention provide methods andsystem for conditionally issuing read requests to a commonly accessedstorage volume, depending on whether or not a requested data block isalready stored in memory from a prior access or to be stored in memoryupon completion of a pending request. As a result, the total number ofinput/output requests issued to a storage volume can be significantlyreduced, especially in a VDI system where a large number of reads madeby different virtual machines supporting the VDI system are to the sameaddress in the storage volume.

A method of processing read I/O requests in a computer system havingapplications running therein, according to an embodiment of theinvention, employs a tracking data structure for a first group ofmachine memory pages, each having at least two virtual memory pagesmapped thereto, and a second group of machine memory pages, each havingonly one virtual memory page mapped thereto, wherein the tracking datastructure indicates, for each of the machine memory pages, acorresponding location in a storage volume from which its contents wereread or are being read. According to this method, in response to a readrequest, the tracking data structure is used to determine that a machinememory page in the first group or the second group contains or willcontain data stored in a location of the storage volume indicated in theread request, and a virtual memory page associated with the read requestis mapped to this machine memory page. Further, in response to a memorywrite request to a machine memory page of either the first group or thesecond group, the contents of the associated machine memory is copied toa new machine memory page, and a virtual memory page associated with thememory write request is mapped to the new machine memory page.

A method of processing read I/O requests in a computer system havingapplications running therein, according to another embodiment of theinvention, employs a tracking data structure for a first group ofmachine memory pages, each having at least two virtual memory pagesmapped thereto, a second group of machine memory pages, each having onlyone virtual memory page mapped thereto, and a third group of machinememory pages, each having no virtual memory pages mapped thereto,wherein the tracking data structure indicates, for each of the machinememory pages, a corresponding location in a storage volume from whichits contents were read or are being read, and a reference countindicating how many virtual memory pages are mapped to the machinememory page. According to this method, in response to a read request, avirtual memory page associated with the read request is mapped to amachine memory page in one of the first, second, and third groups, andthe reference count associated with the machine memory page isincremented.

A method of processing read I/O requests in a computer system havingvirtual machines running therein, according to yet another embodiment ofthe invention, employs a tracking data structure for a set of machinememory pages, wherein the tracking data structure indicates, for each ofthe machine memory pages, a corresponding location in a storage volumefrom which its contents were read or are being read, a reference countindicating how many virtual memory pages are mapped to the machinememory page, and a pending status flag indicating whether or not a readfrom the storage volume is pending. This method includes the steps of,in response to a read request, determining using the tracking datastructure if one of the machine memory pages being tracked contains orwill contain data stored in a location of the storage volume indicatedin the read request, and if there is a machine memory page being trackedthat contains or will contain data stored in a location of the storagevolume indicated in the read request, mapping a guest physical memorypage associated with the read request to the machine memory page, and ifthere is no machine memory page being tracked that contains or willcontain data stored in a location of the storage volume indicated in theread request, issuing a request to the storage volume for the datastored in the location of the storage volume indicated in the readrequest, mapping a guest physical memory page associated with the readrequest to the machine memory page in which the requested data will bestored, and marking the guest physical memory page associated with theread request with a page sharing hint.

A computer system according to an embodiment of the invention includes ahost platform for virtual machines and a storage volume connectedthereto. The host platform for the virtual machines includes one or moreprocessors and system memory having stored therein a tracking datastructure for a first group of machine memory pages, each having atleast two virtual memory pages mapped thereto, a second group of machinememory pages, each having only one virtual memory page mapped thereto,and a third group of machine memory pages, each having no virtual memorypages mapped thereto. The tracking data structure indicates, for each ofthe machine memory pages, a corresponding location in a storage volumefrom which its contents were read or are being read, and a referencecount indicating how many virtual memory pages are mapped to the machinememory page, and a pending status flag indicating whether or not a readfrom the storage volume is pending. A read request issued by any of thevirtual machines is conditionally issued to the storage volume based onthis tracking data structure.

Other embodiments include, without limitation, a computer-readablestorage medium that includes instructions that enable a processing unitto implement one or more aspects of the disclosed methods.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a computer system configured toimplement one or more aspects of the present invention.

FIG. 2 illustrates a base volume and linked clones of the base volume,according to one embodiment of the invention.

FIG. 3A illustrates a memory mapping from a guest virtual page number(GVPN) within a virtual machine (VM) to a machine page number (MPN) viaa guest physical page number (GPPN), according to one embodiment of theinvention.

FIG. 3B illustrates a second memory mapping from a second GVPN within asecond VM to a second MPN via a second GPPN, according to one embodimentof the invention.

FIG. 3C illustrates a “flip” operation whereby the second GPPN isremapped to the first MPN for sharing and the second MPN is released,according to one embodiment of the invention.

FIG. 3D illustrates a copy-on-write remapping of the shared MPN to aprivate MPN for the second VM, according to one embodiment of theinvention.

FIG. 3E illustrates the shared MPN being orphaned as a result of acopy-on-write remapping triggered by the first VM, according to oneembodiment of the invention.

FIG. 4 illustrates a page tracking table configured to include at leasta reference count and a pending status bit for each tracked page,according to one embodiment of the invention.

FIG. 5 is a flow diagram of method steps, performed by a hypervisor, fortracking and shortcutting input/output requests, according to oneembodiment of the invention.

FIG. 6A illustrates a memory mapping from a GVPN within a virtualmachine to an MPN associated with an input/output read request,according to one embodiment of the invention.

FIG. 6B illustrates a copy operation of a requested block of dataassociated with one MPN to another MPN, according to one embodiment ofthe invention.

FIG. 7 is a flow diagram of method steps, performed by a hypervisor, fortracking and shortcutting input/output requests, according to oneembodiment of the invention.

DETAILED DESCRIPTION

FIG. 1 is a block diagram illustrating a computer system 100 configuredto implement one or more aspects of the present invention. In oneembodiment, the computer system 100 is configured to implement a virtualdesktop infrastructure (VDI) system, whereby state for a plurality ofdifferent instances of the virtual desktop environment resides inmachine memory 110 and processing for the different instances isperformed by a processor complex 112. The computer system 100 includesprocessor complex 112, machine memory 110, and a mass storage system114. The processor complex 112 may be coupled to the machine memory 110via any technically feasible electrical interface, such as a dynamicrandom access memory (DRAM) interface. In other configurations, theprocessor complex 112 may be coupled to machine memory 110 via anintermediary interconnect, such as industry standard HyperTransport(TM), PCI-Express (TM), Intel (TM) QuickPath Interconnect (QPI), or anyother technically feasible transport interface. Details regarding theintermediary interconnect are not shown herein for the sake of clarity.The processor complex 112 may be coupled to mass storage system 114 viaa native storage interface, such as serial advanced technologyattachment (SATA), serial attached small computer system interface(SAS), or any other technically feasible native storage interface. Inother configurations, the processor complex 112 is coupled to massstorage system 114 via a network enabled storage interface such as FibreChannel, internet small computer system interface (iSCSI), or any othertechnically feasible network enabled storage interface.

The processor complex 112 includes, without limitation, a memoryinterface 140 and one or more central processing units (CPU) 142-1through 142-m. The memory interface 140 is configured to couple machinememory 110 to the one or more CPUs 142. Each one of the one or more CPUs142 is configured to execute program instructions stored within machinememory 110. The program instructions are organized as software modulesthat may be stored for execution within machine memory 110. Each one ofthe one or more CPUs 142 include a memory management unit (MMU) 141configured to perform, without limitation, translation of addresses,such as guest physical addresses to machine addresses, or nestedtranslations mapping guest virtual addresses to guest physical addressesto machine addresses. A disk interface 144 and a network interface 146are coupled to the processor complex 112. The disk interface 144 isconfigured to couple the mass storage system 114 to the one or more CPUs142. The disk interface 144 may include implementation specificfunctionality related to controlling disk systems. Such functionalitymay include, for example, control for redundant array of independentdisks (RAID) and caching. The mass storage system 114 may comprise anytechnically feasible storage elements, such as magnetic disk drives,solid state drives, or any other elements configured to read and writedata for persistent storage. The network interface 146 is configured tocouple network port 116 to the one or more CPUs 142. The networkinterface 146 is configured to couple network port 116 to the one ormore CPUs 142 within processor complex 112. The network interface mayinclude any functionality required to transmit and receive data packetsvia the network port 116. In one configuration, the network port 116 isan industry standard Ethernet port.

Machine memory 110 has stored therein a kernel 134, a hypervisor 136,and virtual machines (VMs) 120, which are used by the processor complex112 to provide instances of a desktop environment as part of the overallVDI system. The hypervisor 136 is a software virtualization layerconfigured to provide a runtime environment for the VMs 120 and toexecute at least one virtual machine monitor (VMM) 130 that isassociated with a corresponding one of the VMs 120. Each one of the VMs120 is associated on a one-to-one basis with one of the VMMs 130. Thehypervisor 136 includes a sharing module 138 that is configured toimplement a page sharing mechanism for sharing pages of memory thatcontain identical data. Page sharing is described in greater detailbelow. The hypervisor 136 also includes an orphan cache 139 that isconfigured to retain pages of memory corresponding to blocks of diskdata that may be requested in future disk block requests but otherwiseare not referenced by a client process.

As shown, VM 120-1 includes one or more virtual processors 122, andguest system software 127. An application 128 may launch and executeaccording to a conventional run time model for a conventional user-spaceor kernel-space application for the guest system software 127. In oneembodiment, the guest system software 127 includes a guest operatingsystem (OS) 124, such as a commodity operating system. The guest systemsoftware 127 also includes a user interface 126, such as a windowmanagement system. The guest OS 124 is conventionally configured toprovide process control, memory management, and other services requiredby the application 128 and to present a desktop environment to a user.The guest OS 124 includes guest drivers (DRVS) 125 configured to managecorresponding virtual devices (not shown) accessible via the virtualprocessor 122. The virtual devices are implemented in software toemulate corresponding system hardware components of an actual hardwareprocessor system. The virtual devices may include, without limitation, agraphics frame buffer, a network interface, a mass storage interface,peripheral devices, and system memory. During normal execution, theapplication 128 generates load and store requests targeting a virtualaddress space, organized as guest virtual page numbers (GVPNs). Arequest to a GVPN within the guest virtual address space may be mappedto a corresponding guest physical address and guest physical page number(GPPN) by the emulated MMU function within the virtual processor 122.Guest physical memory is organized as distinct units, called pages, eachwith a corresponding, unique GPPN. The application 128 and the guestsystem software 127 may generate I/O requests, such as I/O blockrequests to the mass storage system 114. VM 120-1 may implement virtualdirect memory access (DMA), enabling its virtual storage devices todirectly access GPPNs in response to I/O requests from the guest OS 124.

Each one of the VMs 120 may have a substantially identical internalstructure to VM 120-1. Each one of the VMs 120 may also have anindependent physical address space, and therefore a given GPPN withinone VM 120 is independent of the same GPPN within a different one of theVMs 120. Each GPPN references a page of guest physical memory, which ismapped to a page of machine memory that is referenced via a machine pagenumber (MPN). Each GPPN may alternatively be backed by a machine memorypage in a remote host, or backed by a location in a swap file 150residing within the mass storage system 114. In some cases, a GPPN maynot be mapped to or backed by any page of machine memory or swap filelocation. More than one GPPN may map to a common, shared MPN.

Each mapping to a shared MPN is marked with a “copy on write” (COW)attribute. “Copy on write” describes a well known technique in the artin which mappings to a particular page are marked read-only so that whenthe page is written to, an exception is triggered, and an associatedexception handler is configured to cause the page to be copied to a newlocation in memory, which is then written to according to the originalwrite request. In the context of page sharing, a plurality of GPPNs maybe mapped to a single shared memory page, for example. The page sharingis transparent to the guest so that from the guest's perspective theyhave their own copy of the data. When a guest attempts to write theshared page, however, it is copied, and the write is directed to thenewly created copy of the page, which is then “owned” by (i.e., mappedto) only that guest. Contents of U.S. Pat. Nos. 6,789,156 and 7,620,766relating to read-only mappings, COW, content-based deduplication, andaccelerated comparisons by hashing, are hereby incorporated byreference.

In general, a VMM provides an interface between a VM and a host runtimeenvironment. The host runtime environment may be a conventionaloperating system or a kernel configured to manage hardware elements andoverall operation of the computer system 100 and thereby provide systemservices to the VMM. Alternatively, the host runtime environment may beany technically feasible software module configured to manage thecomputer system 100 and thereby provide system services to the VMM. TheVMM provides access to hardware devices and system services to enablethe VM to emulate an apparent hardware system via the virtual processor122.

In one embodiment, the VMM 130-1 is configured to provide a softwareinterface between VM 120-1 and the kernel 134. In other embodiments, theVMM 130-1 may be configured to provide an interface between VM 120-1 anda host operating system (not shown). The VMM 130-1 includes aGPPN-to-MPN map 132-1, used to translate guest physical addressesgenerated by VM 120-1 into corresponding machine addresses that may beused to access data stored in machine memory 110. Each VMM 130-1 through130-n includes a respective GPPN to MPN map 132. In one embodiment,GPPN-to-MPN maps 132-1 through 132-n are managed by the kernel 134.

The kernel 134 is configured to manage certain hardware and softwareresources within the processor complex 112. In particular, the kernel134 schedules and manages processes VM 120-1 through 120-n, and VMM130-1 through VMM 130-n, executing on the one or more CPUs 142. Thekernel 134 includes at least one memory management table 135, configuredto maintain each GPPN to MPN mapping for accessing machine memory 110.The memory management table 135 includes mappings for each GPPN to MPNmap 132-1 through 132-n. In this way, the kernel has a global view ofall guest physical address to machine address mappings.

The total storage configured for all guest physical address spaces forVMM 130-1 through VMM 130-n may exceed the total available storagewithin machine memory 110. The kernel 134 implements a memory pagingsystem that swaps selected pages of memory between machine memory 110and the swap file 150 within the mass storage system 114. Anytechnically feasible technique may be used to page data between machinememory 110 and the swap file 150, residing within a persistent storagesystem, such as the mass storage system 114. In addition, datacompression techniques may be used to present a virtual memory spacethat is larger than the machine memory 110. Techniques such asballooning may be used to trigger guest-level memory reclamation such aspage swapping. Furthermore, any technically feasible technique may beimplemented to select a page to be swapped from machine memory 110 to aswap file and vice versa. When a page of memory is swapped from machinememory 110 to the swap file 150, the memory management table 135 isupdated to reflect a change in disposition of the contents of acorresponding GPPN as being in the swap file 150 rather than resident inmachine memory 110. Similarly, when a page within the swap file 150 isswapped into machine memory 110, the memory management table 135 may beupdated to reflect another change in disposition for the contents of acorresponding GPPN as being resident in machine memory 110 rather thenin the swap file 150. In certain embodiments, the VMM 130 intercepts MMUinstructions from the guest OS 124 to enable the kernel 134 to configureMMUs 141 to perform direct GVPN to MPN mappings for high performancememory access from each corresponding VM 120. For example, certain MMUimplementations allow nested mappings, enabling the MMU to performdirect GVPN to MPN mappings for high performance access from eachcorresponding VM 120.

The mass storage system 114 includes a base volume 160, configured toinclude storage blocks 162-1 through 162-k. The base volume 160 is areference disk image that includes a generic image of the guest systemsoftware 127, the application 128, and any other shared softwaremodules. The base volume may also include commonly used data files. Avirtual disk volume is generated and presented to a VM 120, based on thebase volume 160 and on any modifications to specific blocks 162 residingwithin the base volume 160. Modifications to specific blocks 162 arestored in a specific difference file 164, which is associated with acorresponding VM 120. Initially, each virtual disk volume presented toeach VM is substantially identical to the base volume 160. However,during the normal course of execution, a given VM 120 may write one ormore blocks within a corresponding virtual disk volume, triggering acopy on write (COW) for each one of the one or more blocks beingwritten. The process of writing a block, via COW, to the virtual diskvolume “specializes” the block for the virtual disk volume. Specializedblock data is stored within a corresponding difference file 164, leavingthe base volume 160 unchanged. For example, VM 120-1 may write block162-k of the virtual disk volume, thereby specializing block 162-k.Thereafter, data from the specialized block 162-k will be returned whenread by VM 120-1. VM 120-n is presented with an independent and privatevirtual disk volume. VM 120-n may independently specialize block 162-kwithin its corresponding virtual disk volume, whereby correspondingwrite data is stored in difference file 164-n.

By generating a virtual disk volume for each corresponding VM 120 frombase volume 160 and a difference file 164, overall required storagewithin the mass storage system 114 may be substantially reduced.Simultaneously, each VM 120 is presented an apparently independent,private, and writable instance of the base volume 160. This technique ofgenerating and presenting virtual disk volumes to the VMs 120 isadvantageously efficient and transparent to the VMs 120 and relatedguest system software 127.

In one embodiment, the guest system software 127 and relatedapplications, such as application 128, comprise a desktop environmentthat a user may invoke and access, for example via network port 116. Thecomputer system 100 is configured to start VMs 120-1 through VM 120-n,which causes corresponding desktop environments to boot and become readyfor access by users. Each VM 120 typically reads a substantiallyidentical set of data blocks when booting the guest system software 127.In conventional systems, the identical blocks are redundantly requested,contributing to a “boot storm.” In one or more embodiments of thepresent invention, however, redundant read requests are shortcut by adisk page tracker 170, described in greater detail below. Shortcut I/Orequests are completed within the hypervisor 136 and, therefore, do notcontribute additional I/O traffic.

FIG. 2 illustrates a base volume 160 and linked clones 210 of the basevolume, according to one embodiment of the invention. The base volume160 and difference files 164 reside on the mass storage system 114 ofFIG. 1. Each difference file 164 stores specialized blocks for acorresponding VM 120. In one embodiment each difference file 164comprises a random access file within a host file system; however,persons skilled in the art will recognize that other technicallyfeasible means for representing specialized blocks associated with aparticular VM 120 may also be implemented without departing the scope ofthe invention.

The base volume 160, or certain blocks thereof, may be represented insystem memory 110 as a read-only common OS base disk 214, which linkedclones 210 are configured to share as a root volume. In one embodiment,each linked clone 210 comprises a virtual disk volume with a blockconfiguration that corresponds to the block configuration of the commonOS base disk 214. Each unspecialized block within a given linked clone210 corresponds to an equivalent block within the common OS base disk214. Each specialized block within the linked clone 210 corresponds to aspecialized block from an associated set of specialized blocks 212-1,212-n.

For example, block “Blk 1” of linked clone 210-1 is unspecialized (hasnot been written by VM 120-1) and therefore corresponds to “Blk 1” ofthe common OS base disk 214. Similarly, block “Blk n-1” of linked clone210-1 is unspecialized and therefore corresponds to “Blk n-1” of thecommon OS base disk 214. However, blocks “Blk 3” and “Blk n” of linkedclone 210-1 are specialized and therefore correspond to blocks “Blk 3”and “Blk n” of the set of specialized blocks 212-1. In one embodiment,blocks “Blk 3” and “Blk n” of the set of specialized blocks 212-1 arestored in difference file 164-1 within the mass storage system 114. Inthis example, “Blk 1” is not specialized for either linked clone 210-1or 210-n. Because block “Blk 1” remains unspecialized with respect toeach VM 120, only one instance of “Blk 1” is actually needed in machinememory 110. As described below, any blocks within the common OS basedisk 214 that are unspecialized need only one instance in machine memory110 when being read by an arbitrary number of VMs.

When a disk block is read from a target disk volume by a VM 120,sufficient storage within machine memory 110 is allocated to the VM 120to receive the requested disk block prior to a read being posted to thetarget disk. The storage is allocated in the form of one or more machinepages, referenced by one or more associated MPNs. The disk page tracker170 of FIG. 1 maintains a page tracking table 175 comprising entriesthat each reference a disk block identified by volume identifier andblock offset. Each entry also includes an MPN and metadata associatedwith the disk block. An entry is added to the page tracking table 175when a newly requested disk block represents a good candidate forsharing. For example, when a disk block resides on the base volume 160,that disk block is a good candidate for sharing. Blocks accessed once onthe base volume 160 by one VM 120 tend to be accessed again by anotherVM 120. The page tracking table 175 is described in greater detail belowin FIG. 4. The disk page tracker 170 is configured to track which diskblocks are being accessed by which VM 120, and to trigger MPN pagesharing as described below. When disk block data residing in a memorypage referenced by a shared MPN is written by an associated VM 120, thedisk block is specialized for the VM 120 and the shared MPN is unsharedto create a private MPN for the VM 120. The disk block may bespecialized using any technically feasible technique and the shared MPNmay be unshared using any technically feasible technique, such as anytechnically feasible variation of COW. The disk page tracker 170comprises a set of procedures configured to execute in response to adisk block access request. Certain of the procedures are configured tomaintain the page tracking table 175. Certain other procedures areconfigured to either generate an I/O request from an input request or toshortcut the I/O request and to generate an equivalent result withoutgenerating an I/O request. In one embodiment, certain procedures of theset of procedures may be implemented as call-back functions registeredto I/O request state for in-flight I/O requests.

In the embodiments illustrated herein, the size of a disk block and thesize of a page in memory are the same. However, the invention isapplicable to other embodiments where the size of the disk block and thesize of a page in memory are not the same.

FIG. 3A illustrates a memory mapping from a guest virtual page number(GVPN) 310 within a VM 120-1 to an MPN 314 via a GPPN 312, according toone embodiment of the invention. Each page number corresponds to a pageof memory within a contiguous range of pages of memory. The GVPN 310corresponds to a page of memory within a range defined by a guestvirtual address space. The GVPN 310 is mapped to GPPN 312 by the guestOS 124 of FIG. 1. The GPPN 312 corresponds to a page of memory within arange defined by a guest physical address space, which the guest OS 124treats as a private physical address space. However, the GPPN 312 isfurther mapped into MPN 314 by the hypervisor 136. The MPN 314corresponds to a page of machine memory 110, which is actual memorycircuitry, such as a dynamic random access memory (DRAM), configured tostore pages of data comprising the machine memory 110.

After a block of data that is a candidate for sharing is read from adisk volume, the block of data resides in a page of memory referenced byMPN 314, which is mapped as a read-only page by each associated GPPN. Ablock of data that is not a candidate for sharing may be stored in apage of memory referenced by an MPN configured for read/write access.

FIG. 3B illustrates a second memory mapping from a second GVPN 320within a second VM 120-n to a second MPN 334 via a second GPPN 322,according to one embodiment of the invention. A second pagecorresponding to MPN 334 is allocated and mapped to GVPN 320 via GPPN322 prior to a read being posted to retrieve a corresponding disk block.At this point the guest OS 124 is managing GPPN 322 as a disk cache forrelated file data. In this scenario, MPN 314 and MPN 334 both correspondto the same volume and disk block within the volume, and the disk pagetracker 170 of FIG. 1 identifies a pending disk block requested for MPN334 as the same disk block previously retrieved into a page of memoryreferenced by MPN 314. Therefore, rather than posting a read request toretrieve the corresponding disk block from the mass storage device 114,GPPN 322 is instead remapped to share MPN 314. MPN 334 is then released,as shown below in FIG. 3C.

FIG. 3C illustrates a “flip” operation whereby the second GPPN 322 isremapped to the first MPN 314 for sharing and the second MPN 334 isreleased, according to one embodiment of the invention. By remappingGPPN 322 to MPN 314, the end effect of posting a disk read to retrievedata associated with GPPN 322 is accomplished without actually postingthe disk read request or generating related I/O traffic. In this way,the hypervisor 136 of FIG. 1 may shortcut I/O requests by mapping theircorresponding destination GPPN to a known page resident in machinememory 110. When a VDI system is executing a plurality of desktopenvironment instances, a significant portion of the disk blocksretrieved from the base volume 160 to boot each instance are likely tobe stored in pages referenced by shared MPNs, which reduces overallmemory pressure on the machine memory 110.

FIG. 3D illustrates a copy-on-write remapping of the shared MPN 314 to aprivate MPN 340 for the second VM 120-n, according to one embodiment ofthe invention. If VM 120-n attempts to write GPPN 322, previously mappedto MPN 314 in FIG. 3C, a COW operation is performed and GPPN 322 isremapped to private MPN 340. The private MPN 340 is needed to receivethe write data. A corresponding disk block is also specialized using anytechnically feasible technique to preserve the data resident in a pageof memory referenced by MPN 340 for future access from the mass storagesystem 114.

FIG. 3E illustrates the shared MPN 314 being orphaned as a result of acopy-on-write remapping triggered by the first VM 120-1, according toone embodiment of the invention. If VM 120-1 attempts to write GPPN 312,previously mapped to MPN 314 in FIG. 3D, a COW operation is performedand GPPN 312 is remapped to private MPN 342. The private MPN 342 isneeded to receive the write data.

The previously shared MPN 314 is orphaned in an orphan cache 139,configured to reference and therefore retain MPNs corresponding topreviously shared disk blocks deemed likely candidates for futuresharing. The orphan cache 139 retains disk blocks that may be accessedat some point in the future by a VM 120, although no VM 120 is currentlyaccessing the blocks. As the number of MPNs referenced by the orphancache 139 grows, a replacement policy may evict certain cache entries tomake space for new cache entries. Any technically feasible replacementpolicy, such as a least recently used (LRU) policy, may be used toinitiate eviction of cache entries from the orphan cache 139. In oneembodiment, a disk block is deemed a likely candidate for sharing if thedisk block resides on a shared base volume. In another embodiment, adisk block is deemed a likely candidate for sharing if the disk block ischaracterized as meeting or exceeding a certain threshold value for adisk block reference count, supplied by a file system that is configuredto track disk block reference counts. The threshold may be as low astwo, or arbitrarily high. The file system may track disk block referencecounts using any technically feasible technique. Persons skilled in theart will recognize that other techniques for characterizing a disk blockas being a likely candidate for sharing may be utilized withoutdeparting the scope of the present invention.

FIG. 4 illustrates a page tracking table 175 configured to include atleast a reference count 414 and a pending status bit 416 for eachtracked page, according to one embodiment of the invention. The pagetracking table 175 associates a specified disk block with a particulartarget reference 410, comprising a volume identifier (baseDisk) andblock offset (Offset) with an MPN 412. In another embodiment, physicaldisk locations of disk blocks may be tracked directly without usingreferences to volume identifiers or block offsets. Therefore, the targetreference 410 represents a unique disk block within the mass storagesystem 114 of FIG. 1 or any other storage system coupled to the computersystem 100. The MPN 412 represents data from the unique disk block asbeing resident in machine memory. In one embodiment, each disk blockrepresented within the page tracking table 175 has been characterized aslikely to be shared. Other blocks are not represented within the pagetracking table 175.

The MPN 412 is characterized by the reference count 414, which indicateshow many guest physical pages are mapped into the MPN 412. The MPN 412is also characterized by the pending status bit 416, which indicatesthat a corresponding disk block request is currently in flight but hasnot yet completed. The pending status bit 416 allows the hypervisor 136to avoid launching an additional request to a disk block that hasalready been requested but is not yet resident within machine memory. Inone embodiment, a first request to a particular disk block results in areference count 414 being set to two, indicating that one guest physicalpage is mapped to the corresponding MPN 412, and the orphan cache 139 isotherwise retaining the MPN 412. When a second guest physical page ismapped to the same MPN 412, the reference count is set to three. When noguest physical page is currently mapped to the MPN 412, the referencecount is set to one, indicating the orphan cache 139 is retaining theMPN 412. In each case, the MPN 412 is mapped read-only. When the orphancache 139 needs to evict a specific MPN, the orphan cache 139 remove itsreference to the MPN, reducing the reference count for the MPN to zero,thereby allowing the page of memory referenced by the MPN to bereclaimed. Persons skilled in the art will recognize that the aboveassigned meanings of reference count values are exemplary and othersystems of reference count values may be implemented without departingthe scope of the present invention.

FIG. 5 is a flow diagram of method steps 500, performed by a hypervisor,for tracking and shortcutting input/output requests, according to oneembodiment of the invention. Although the method steps are described inconjunction with the system of FIGS. 1 through 4, persons skilled in theart will understand that the method steps carried out in any system arewithin the scope of the invention.

The method begins in step 510, where the hypervisor 136 receives an I/Oread request for a disk block, comprising a volume identifier, such as adisk identifier, and a block offset into the volume. The I/O readrequest typically originates from a VM 120.

If, in step 512, the requested disk block has a sufficient likelihood ofbeing shared, then the method proceeds to step 514. In one embodiment,if the requested disk block resides on a shared base disk, such as basevolume 160, then the requested disk block is deemed to have sufficientlikelihood of being shared. In an alternative embodiment, if the requestdisk block is characterized as exceeding a certain threshold value for adisk block reference count, supplied by a file system that is configuredto track disk block reference counts, then the requested disk block isdeemed to have sufficient likelihood of being shared.

If, in step 514, the requested disk block is tracked by the pagetracking table 175, then the method proceeds to step 520. The requesteddisk block is determined to be tracked if the requested disk block isrepresented by an entry in the page tracking table 175. The requesteddisk block is tracked but pending if the pending status bit 416.

In step 520, the hypervisor 136 flips the page mapping of the GPPNallocated for the requested disk block to map to the MPN of acorresponding tracked MPN, indicated in the page tracking table 175. Instep 522, the page tracking metadata is updated to reflect anincremented reference count 414. In step 524 page sharing metadata isupdated, for example, to inform the sharing module 138 that anadditional GPPN is mapped to the shared MPN. Any technically feasiblepage sharing mechanism may be implemented by the sharing module 138without departing the scope of the present invention.

In step 550, the method returns requested page data in the form of amapped MPN. In one embodiment, if the pending status bit 416 for thecorresponding page is still set then the method waits until the pendingstatus bit is cleared before returning. The method terminates in step590.

Returning to step 514, if the requested disk block is not tracked by thepage tracking table 175, then the method proceeds to step 530, where thehypervisor 136 initiates execution of a read input/output operation toretrieve the requested disk block from mass storage system 114. In step532, data for the requested disk block is stored in a correspondingmapped MPN(s). In step 534, an entry is added to the page tracking table175, with page tracking metadata comprising target reference 410, MPN412, reference count 414, and a pending status bit 416. In oneembodiment, the initial value of reference count 414 is set to two. Thepending status bit 416 is set to one until requested block data iswritten to completion in the page of memory referenced by MPN 412, afterwhich the pending status bit 416 is cleared. In step 536 the MPN 412 ismarked for sharing to the sharing module 138 and anyimplementation-specific metadata is added for MPN 412. In oneembodiment, the GPPN mapped to MPN 412 and the associated VM identifierare added to a data structure (e.g., queue or list) containing pagesharing hints. This data structure is maintained by the sharing module138, and used by the sharing module 138 to select candidate pages forsharing.

Returning to step 512, if the requested disk block does not have asufficient likelihood of being shared, then the method proceeds to step540, where the hypervisor 136 initiates execution of a read input/outputoperation to retrieve the requested disk block from mass storage system114. In step 542, data for the requested disk block is stored in acorresponding mapped MPN(s).

An alternative embodiment is described below in FIGS. 6A-6B and 7. Inthe alternative embodiment, each requested disk block that is acandidate for sharing is read and stored as a read-write block mappedinto a GPPN within a VM originating the request. Prior to returning,however, a copy is made of a corresponding MPN and placed into theorphan cache 139. In this way, if the VM originating the requestattempts to write the MPN soon after the MPN was read, a relativelyexpensive write fault is avoided. GPPN and its associated VM identifiermay be added to a page sharing hint data structure, which is later usedby the sharing module 138 to find page sharing opportunities, to therebyreducing overall memory pressure. If a subsequent I/O request is a matchfor the MPN copy within the orphan cache 139, then the I/O request maybe shortcut and served by the orphan cache 139. In another alternativeembodiment, one or more MPNs are added to a page sharing hint datastructure, and a back map from the MPNs to the GPPNs is used by thesharing module 138 to enable sharing of one or more MPNs by two or moreGPPNs.

FIG. 6A illustrates a memory mapping from a GVPN 610 within a virtualmachine 120-1 to an MPN 614 associated with an input/output readrequest, according to one embodiment of the invention. The GVPN 610 ismapped via a GPPN 612 to the MPN 614. After an associated disk block ofdata that is a candidate for sharing is read from a disk volume, theblock of data resides in a page of memory referenced by MPN 614, and themapping from GPPN 612 to MPN 614 is marked with a read-write attribute.Prior to returning from the read request, the hypervisor 136 copies thedata associated with MPN 614 to create a copy of the page of memoryassociated with MPN 614 in the orphan cache 139.

FIG. 6B illustrates a copy operation of a requested block of dataassociated with one MPN 614 to another MPN 620, according to oneembodiment of the invention. The contents of MPN 614, which areotherwise privately mapped to GPPN 612, are copied to MPN 620, whichresides within the orphan cache 139. A reference to MPN 620, along withat least a target reference 410, reference count (initially equal to“1”) and pending bit are added to the page tracking table 175. In oneembodiment, GPPN 612 and its associated VM identifier are added to thepage sharing hint data structure. They are removed from the page sharinghint data structure when GPPN 612 is modified. The page sharingmechanism implemented by the sharing module 138 first reads the pagesharing hint data structure and looks for candidate pages for sharingprior to any other sharing activity. The page sharing mechanism may beimplemented using any technically feasible technique without departingthe scope of the present invention. For example, the page sharingmechanism may be a daemon or background service that identifiescandidate pages for sharing, hashes the contents of the page, andcompares the hash value to an existing hash table. As each page ishashed, it may be added to the hash table. Pages may be identified forsharing using any technique, including randomly. When a matching hash isidentified, the corresponding page is shared with the candidate page byremapping the related GPPN to a common MPN and releasing the redundantMPN. The page sharing mechanism may receive hints from other processes.If the contents of the hinted pages are identical, then they may beimmediately shared rather than waiting to be randomly discovered.Details of this mechanism are described in U.S. Pat. Nos. 6,789,156 and7,620,766, previously incorporated by reference.

Because MPN 620 was deemed likely to be shared, it is at leasttemporarily held in the orphan cache 139, allowing another VM 120 anopportunity to request MPN 620 and other MPNs within the orphan cache139. Each time a VM 120 is able to retrieve a disk block of data frommachine memory 110 (cache) an input/output operation is shortcut and I/Oresources are preserved.

FIG. 7 is a flow diagram of method steps 700, performed by a hypervisor,for tracking and shortcutting input/output requests, according to theembodiment of the invention described in conjunction with FIGS. 6A-6B.Although the method steps are described in conjunction with the systemof FIGS. 1 through 4, persons skilled in the art will understand thatthe method steps carried out in any system are within the scope of theinvention.

The method begins in step 710, where the hypervisor 136 receives an I/Oread request for a disk block, comprising a volume identifier, such as adisk identifier, and a block offset into the volume. The I/O readrequest typically originates from a VM 120.

If, in step 712, the requested disk block has a sufficient likelihood ofbeing shared, then the method proceeds to step 714. In one embodiment,if the requested disk block resides on a shared base disk, such as basevolume 160, then the requested disk block is deemed to have sufficientlikelihood of being shared. In an alternative embodiment, if the requestdisk block is characterized as exceeding a certain threshold value for adisk block reference count, supplied by a file system that is configuredto track disk block reference counts, then the requested disk block isdeemed to have sufficient likelihood of being shared.

If, in step 714, the requested disk block is tracked by the pagetracking table 175, then the method proceeds to step 720. The requesteddisk block is determined to be tracked if the requested disk block isrepresented by an entry in the page tracking table 175. The requesteddisk block is tracked but pending if the pending status bit 416.

In step 720, the hypervisor 136 flips the page mapping of the GPPNassociated with the requested disk block to map to the MPN of acorresponding tracked MPN, indicated in the page tracking table 175. Instep 722, the page tracking metadata is updated to reflect anincremented reference count 414. In step 724 page sharing metadata isupdated, for example, to inform the sharing module 138 that anadditional GPPN is mapped to the shared MPN.

In step 750, the method returns requested page data in the form of amapped MPN. In one embodiment, if the pending status bit 416 for thecorresponding page is still set then the method waits until the pendingstatus bit is cleared before returning. The method terminates in step790.

Returning to step 714, if the requested disk block is not tracked by thepage tracking table 175, then the method proceeds to step 730, where thehypervisor 136 initiates execution of a read input/output operation toretrieve the requested disk block from mass storage system 114. In step732, data for the requested disk block is stored in a correspondingread-write mapped MPN(s). In step 733, data associated with theread-write mapped MPN is copied to a page of memory referenced by asecond MPN, which is then mapped for read-only access. In step 734, anentry is added to the page tracking table 175, with page trackingmetadata comprising target reference 410, MPN 412 (references MPN mappedfor read-only access), reference count 414, and a pending status bit416. In one embodiment, the initial value of reference count 414 is setto two. The pending status bit 416 is set to one until requested blockdata is written to completion in the page of memory referenced by MPN412, after which the pending status bit 416 is cleared. In step 736 theGPPN associated with the requested disk block and the associated VMidentifier are added to the page sharing hint data structure.

Returning to step 712, if the requested disk block does not have asufficient likelihood of being shared, then the method proceeds to step740, where the hypervisor 136 initiates execution of a read input/outputoperation to retrieve the requested disk block from mass storage system114. In step 742, data for the requested disk block is stored in one ormore pages of memory referenced by the corresponding mapped MPN(s).

In sum, a technique for shortcutting input/output operations forcommonly requested blocks of data is disclosed. A page tracking tablemaintains associations between disk blocks and machine page numbers.Disk blocks that are likely to be shared are tracked in the pagetracking table, while others need not be tracked. When a request isreceived for a tracked disk block, the data is presented to a requestingVM by remapping a page allocated within the VM for the requested block.In this way, the outcome of the input/output request is achievedtransparently with respect to the requesting VM. Disk blocks are deemedlikely to be shared based on any technically feasible criteria, such asa disk volume identifier or a disk block reference count.

One advantage of the present invention is that input/output operationsmay be reduced transparently with respect to guest VMs. One additionaladvantage is that only a small portion of additional machine memory isneeded because additional instances of disk block data are minimizedthrough sharing.

In the embodiments of the invention described herein, “virtual” in thecontext of a virtualized system means guest virtual, as in GVPN, orguest physical, as in GPPN, and in the context of a non-virtualizedsystem, means just virtual, as in VPN.

It should be recognized that various modifications and changes may bemade to the specific embodiments described herein without departing fromthe broader spirit and scope of the invention as set forth in theappended claims.

The various embodiments described herein may employ variouscomputer-implemented operations involving data stored in computersystems. For example, these operations may require physical manipulationof physical quantities usually, though not necessarily, these quantitiesmay take the form of electrical or magnetic signals where they, orrepresentations of them, are capable of being stored, transferred,combined, compared, or otherwise manipulated. Further, suchmanipulations are often referred to in terms, such as producing,identifying, determining, or comparing. Any operations described hereinthat form part of one or more embodiments of the invention may be usefulmachine operations. In addition, one or more embodiments of theinvention also relate to a device or an apparatus for performing theseoperations. The apparatus may be specially constructed for specificrequired purposes, or it may be a general purpose computer selectivelyactivated or configured by a computer program stored in the computer. Inparticular, various general purpose machines may be used with computerprograms written in accordance with the teachings herein, or it may bemore convenient to construct a more specialized apparatus to perform therequired operations.

The various embodiments described herein may be practiced with othercomputer system configurations including hand-held devices,microprocessor systems, microprocessor-based or programmable consumerelectronics, minicomputers, mainframe computers, and the like.

One or more embodiments of the present invention may be implemented asone or more computer programs or as one or more computer program modulesembodied in one or more computer readable media. The term computerreadable medium refers to any data storage device that can store datawhich can thereafter be input to a computer system computer readablemedia may be based on any existing or subsequently developed technologyfor embodying computer programs in a manner that enables them to be readby a computer. Examples of a computer readable medium include a harddrive, network attached storage (NAS), read-only memory, random-accessmemory (e.g., a flash memory device), a CD (Compact Discs) CD-ROM, aCD-R, or a CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, andother optical and non-optical data storage devices. The computerreadable medium can also be distributed over a network coupled computersystem so that the computer readable code is stored and executed in adistributed fashion.

Although one or more embodiments of the present invention have beendescribed in some detail for clarity of understanding, it will beapparent that certain changes and modifications may be made within thescope of the claims. Accordingly, the described embodiments are to beconsidered as illustrative and not restrictive, and the scope of theclaims is not to be limited to details given herein, but may be modifiedwithin the scope and equivalents of the claims. In the claims, elementsand/or steps do not imply any particular order of operation, unlessexplicitly stated in the claims.

Plural instances may be provided for components, operations orstructures described herein as a single instance. Finally, boundariesbetween various components, operations and data stores are somewhatarbitrary, and particular operations are illustrated in the context ofspecific illustrative configurations. Other allocations of functionalityare envisioned and may fall within the scope of the invention(s). Ingeneral, structures and functionality presented as separate componentsin exemplary configurations may be implemented as a combined structureor component. Similarly, structures and functionality presented as asingle component may be implemented as separate components. These andother variations, modifications, additions, and improvements may fallwithin the scope of the appended claims(s).

We claim:
 1. A method of processing read I/O requests in a computersystem having applications running therein, the applications includingvirtual memory pages mapped to machine memory pages, the methodcomprising: maintaining a tracking data structure for a first group ofmachine memory pages, each having at least two virtual memory pagesmapped thereto, and a second group of machine memory pages, each havingonly one virtual memory page mapped thereto, the tracking data structureindicating, for each of said machine memory pages, a correspondinglocation in a storage volume from which its contents were read or arebeing read; receiving a read request; identifying a first location ofthe storage volume indicated in the read request; calculating, using thetracking data structure, a likelihood that the first location of thestorage volume will be shared; determining that the likelihood exceeds athreshold value; based on the determination that the likelihood exceedsthe threshold value; determining, using the tracking data structure,that a machine memory page in the first group or the second groupcontains or will contain data stored in the first location of thestorage volume indicated in the read request; based on the determinationthat a machine memory page in the first group or the second groupcontains or will contain data stored in a location of the storage volumeindicated in the read request, determining not to process the readrequest; and based on the determination to not process the read request,mapping a virtual memory page associated with the read request to themachine memory page; and in response to a first memory write request,copying the contents of a machine memory page in the first group to afirst machine memory page, and mapping a virtual memory page associatedwith the first memory write request to the first machine memory page;and in response to a second memory write request, copying the contentsof a machine memory page in the second group to a second machine memorypage, and mapping a virtual memory page associated with the secondmemory write request to the second machine memory page.
 2. The method ofclaim 1, wherein the applications are instances of virtual machines andthe virtual memory pages are guest physical memory pages.
 3. The methodof claim 1, wherein the tracking data structure indicates, for each ofsaid machine memory pages, a reference count indicating how many virtualmemory pages are mapped to the machine memory page, and, in response tosaid read request, the reference count associated with the machinememory page in the first group or the second group is incremented. 4.The method of claim 3, wherein, in response to said first memory writerequest, the reference count associated with the machine memory page inthe first group is decremented, and, in response to said second memorywrite request, the reference count associated with the machine memorypage in the second group is decremented.
 5. The method of claim 4,further comprising: updating the tracking data structure by deleting oneor more of entries of the tracking data structure that correspond to themachine memory pages that have no virtual memory pages mapped thereto.6. The method of claim 1, wherein at least one of the machine memorypages in the first and second groups is a target memory location of aread request issued to the storage volume and completed.
 7. The methodof claim 1, wherein at least one of the machine memory pages in thefirst and second groups is a target memory location of a read requestissued to the storage volume and currently in process.
 8. The method ofclaim 1, wherein calculating a likelihood that the first location of thestorage volume will be shared is based on a location of the firstlocation of the storage volume.
 9. The method of claim 1, whereincalculating a likelihood that the first location of the storage volumewill be shared is based on a number of times the first location ofstorage volume has been requested.
 10. A method of processing read I/Orequests in a computer system having applications running therein, theapplications including virtual memory pages mapped to machine memorypages, the method comprising: maintaining a tracking data structure fora first group of machine memory pages, each having at least two virtualmemory pages mapped thereto, a second group of machine memory pages,each having only one virtual memory page mapped thereto, and a thirdgroup of machine memory pages, each having no virtual memory pagesmapped thereto, the tracking data structure indicating, for each of saidmachine memory pages, a corresponding location in a storage volume fromwhich its contents were read or are being read, and a reference countindicating how many virtual memory pages are mapped to the machinememory page; receiving a read request; identifying a first location ofthe storage volume indicated in the read request; calculating, using thetracking data structure, a likelihood that the first location of thestorage volume will be shared; determining that the likelihood exceeds athreshold value; and based on the determination that the likelihoodexceeds the threshold value, mapping a virtual memory page associatedwith the read request to a machine memory page in one of the first,second, and third groups, and incrementing the reference countassociated with the machine memory page.
 11. The method of claim 10,wherein at least one of the machine memory pages in the first, second,and third groups is a target memory location of a read request issued tothe storage volume and completed.
 12. The method of claim 10, wherein atleast one of the machine memory pages in the first, second, and thirdgroups is a target memory location of a read request issued to thestorage volume and currently in process.
 13. The method of claim 10,further comprising: updating the tracking data structure to delete oneor more entries associated with the machine memory pages in the thirdgroup.
 14. The method of claim 10, wherein, in response to a memorywrite request, copying the contents of a machine memory page in one ofthe first and second groups to a new machine memory page, and mapping avirtual memory page associated with the memory write request to the newmachine memory page.
 15. A method of processing read I/O requests in acomputer system having virtual machines running therein, the virtualmachines including guest virtual memory pages mapped to machine memorypages via guest physical memory pages, the method comprising:maintaining a tracking data structure for a set of machine memory pages,the tracking data structure indicating, for each of said machine memorypages, a corresponding location in a storage volume from which itscontents were read or are being read, a reference count indicating howmany virtual memory pages are mapped to the machine memory page, and apending status flag indicating whether or not a read from the storagevolume is pending; receiving a read request; identifying a firstlocation of the storage volume indicated in the read request;calculating, using the tracking data structure, a likelihood that thefirst location of the storage volume will be shared; determining thatthe likelihood exceeds a threshold value; and based on the determinationthat the likelihood exceeds the threshold value; determining, using thetracking data structure, if one of the machine memory pages beingtracked contains or will contain data stored in a location of thestorage volume indicated in the read request; based on the determinationone of the machine memory pages being tracked contains or will containdata stored in a location of the storage volume indicated in the readrequest, determining not to process the read request; and based on thedetermination to not process the read request, mapping a guest physicalmemory page associated with the read request to the machine memory page;and if there is no machine memory page being tracked that contains orwill contain data stored in a location of the storage volume indicatedin the read request, issuing a request to the storage volume for thedata stored in the first location of the storage volume indicated in theread request, mapping a guest physical memory page associated with theread request to the machine memory page in which the requested data willbe stored, and adding a new entry to a data structure containing pagesharing hints, the new entry identifying the guest physical memory pageassociated with the read request and the virtual machine associated withthe guest physical memory page.
 16. The method of claim 15, furthercomprising: adding a new entry to the tracking data structurecorresponding to the machine memory page in which the requested datawill be stored, the new entry including the location of the storagevolume indicated in the read request, a reference count indicating thatone guest physical memory pages is mapped to the machine memory page,and a pending status flag that is set to indicate whether a read fromthe storage volume is pending or has completed.
 17. A computer systemcomprising: a host platform for virtual machines; and a storage volumeconnected thereto, the host platform for virtual machines including oneor more processors and machine memory, the machine memory having storedtherein a tracking data structure for a first group of machine memorypages, each having at least two virtual memory pages mapped thereto, asecond group of machine memory pages, each having only one virtualmemory page mapped thereto, and a third group of machine memory pages,each having no virtual memory pages mapped thereto, the tracking datastructure indicating, for each of said machine memory pages, acorresponding location in a storage volume from which its contents wereread or are being read, and a reference count indicating how manyvirtual memory pages are mapped to the machine memory page, and apending status flag indicating whether or not a read from the storagevolume is pending, wherein a read request issued by any of the virtualmachines is conditionally issued to the storage volume based on thetracking data structure; and wherein the one or more processors areprogrammed to: receiving a read request; identifying a first location ofthe storage volume indicated in the read request; calculating, using thetracking data structure, a likelihood that the first location of thestorage volume will be shared; determining that the likelihood exceeds athreshold value; and based on the determination that the likelihoodexceeds the threshold value, map a virtual memory page associated withthe read request to a machine memory page in one of the first, second,and third groups.
 18. The computer system of claim 17, wherein theprocessors are programmed to process read requests issued by the virtualmachines and, prior to issuing a read request, to check the trackingdata structure to determine whether there exists a machine memory pagethat contains or will contain data from a location in the storage volumethat is the same as indicated in the read request.
 19. The computersystem of claim 17, wherein the processors are programmed to map avirtual memory page to a machine memory page and increment the referencecount associated with the machine memory page if it is determined thatthe machine memory page contains or will contain data from a location inthe storage volume that is the same as indicated in the read request.20. The computer system of claim 19, wherein the processors are furtherprogrammed to copy the contents of a shared machine memory page to a newmachine memory page in response to a request to write to the sharedmachine memory page and decrement the reference count associated withthe shared machine memory page.